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ABSTRACT 

Personnel  detection  at  border  crossings  has  become  an 
important  issue  recently .  To  reduce  the  number  of  false 
alarms ,  it  is  important  to  discriminate  between  humans  and 
four-legged  animals.  This  paper  proposes  using  enhanced 
summary  autocorrelation  patterns  for  feature  extraction 
from  seismic  sensors,  a  multi-stage  exemplar  selection 
framework  to  learn  acoustic  classifier,  and  temporal  pat¬ 
terns  from  ultrasonic  sensors.  We  compare  the  results  using 
decision  fusion  with  Gaussian  Mixture  Model  classifiers 
and  feature  fusion  with  Support  Vector  Machines.  From 
experimental  results,  we  show  that  our  proposed  methods 
improve  the  robustness  of  the  system. 

Keywords:  Gaussian  Mixture  Models,  Support  Vector  Ma¬ 
chines,  sensor  fusion,  footstep  detection,  personnel  detection 

1.  INTRODUCTION 

Personnel  detection  is  an  important  task  for  Intelligence, 
Surveillance,  and  Reconnaissance  (ISR)  [1,  2].  One  might 
like  to  detect  intruders  in  a  certain  area  during  the  day  and 
night  so  that  the  proper  authorities  can  be  alerted.  For  ex¬ 
ample,  border  crimes  including  human  trafficking  would  be 
reduced  by  automatic  detection  of  illegal  aliens  crossing  the 
border.  There  are  numerous  other  applications  where  person¬ 
nel  detection  is  important. 

However,  personnel  detection  is  a  challenging  problem. 
Video  sensors  consume  high  amounts  of  power  and  require  a 
large  volume  for  storage.  Hence,  it  is  preferable  to  use  non¬ 
imaging  sensors,  since  they  tend  to  use  low  amounts  of  power 
and  are  long-lasting.  Non-imaging  sensors,  however,  suffer 
from  ambiguity  among  the  footsteps  of  animals  alone,  hu¬ 
mans  alone,  and  of  animals  traveling  together  with  humans. 

Traditionally,  personnel  detection  research  concentrated 
on  using  seismic  sensors.  When  a  person  walks,  his/her  im¬ 
pact  on  the  ground  causes  seismic  vibrations,  which  are  cap¬ 
tured  by  the  seismic  sensors.  Previous  works  have  relied  on 
fundamental  gait  frequency  estimation  [3,  4].  Park  et  al.  pro¬ 
posed  the  method  of  extracting  temporal  gait  patterns  to  pro¬ 
vide  information  on  temporal  distribution  of  gait  beats  [5]. 


At  border  crossings,  animals  such  as  mules,  horses,  or 
donkeys  are  often  known  to  carry  loads.  Animal  hoof  sounds 
make  them  distinct  from  human  footstep  sounds.  When  hu¬ 
mans  and  four-legged  animals  walk  together,  the  sounds  they 
make  are  perceptually  distinguishable  by  human  listeners. 
Automatic  algorithms  that  imitate  human  capabilities  in  other 
acoustic  event  detection  tasks  have  been  constructed  [6,  7,  8], 
e.g.,  using  perceptual  linear  predictions  (PLP)  features  cou¬ 
pled  to  tandem  neural  net  -  HMM  recognizers. 

Passive  and  active  ultrasonic  methods  were  proposed  for 
the  detection  of  walking  personnel  for  ultrasound  signals  [9] . 
The  passive  method  utilizes  the  footsteps’  ultrasonic  signals 
generated  by  friction  forces,  while  the  active  method  uses  the 
human  Doppler  ultrasonic  signature.  In  an  outdoor  scene, 
the  passive  ultrasound  signals  are  limited  in  distance  and  are 
noisy.  For  the  active  ultrasound  method,  when  a  person  walks, 
each  limb  is  a  compound  pendulum  and  has  distinct  oscilla¬ 
tory  characteristics,  which  in  turn  results  in  a  micro  Doppler 
effect.  Similarly,  the  torso  also  oscillates  at  a  particular  fre¬ 
quency.  The  ultrasonic  sensors  can  detect  the  ultrasonic  sig¬ 
nature  generated  by  footsteps  and  movements  of  the  torso. 
Zhang  et  al.  reported  that  micro-Doppler  gait  signatures  dif¬ 
fer  between  human  and  four-legged  animals  [10].  These  arise 
from  the  different  physical  mechanisms  found  in  the  differ¬ 
ent  species.  Kalgaonkar  et  al.  analyzed  spectral  patterns  to 
classify  human  walking  (walker  identification,  approach  vs. 
withdraw,  male  vs.  female)  [11]. 

As  shown  in  the  above  literature  review,  existing  research 
only  uses  a  single  sensor  recorded  in  clean  environments  with 
a  single  object  (a  person  or  a  four-legged  animal)  walking. 
However,  in  reality,  when  there  are  many  objects  such  as  peo¬ 
ple  or  four-legged  animals  walking  or  running  in  noisy  envi¬ 
ronments,  it  is  difficult  to  distinguish  human  alone  vs.  animals 
alone  vs.  animals  and  humans  together  using  a  single  sensor 
and  published  approaches. 

In  this  paper,  we  propose  using  enhanced  summary  auto¬ 
correlation  patterns  for  feature  extraction  from  seismic  sen¬ 
sors,  a  multi-stage  exemplar  selection  framework  to  learn 
acoustic  classifier,  and  temporal  patterns  from  ultrasonic  sen¬ 
sors.  Acoustic,  seismic,  and  ultrasound  signals  are  fused 
using  decision  fusion  based  on  Gaussian  Mixture  Models 
(GMMs)  and  feature  fusion  based  on  Support  Vector  Ma- 
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Fig.  1.  Sensor  layout,  where  a  multi-sensor  multi-modal  sys¬ 
tem  has  acoustic,  seismic,  passive  infra-red  (PIR),  radar,  mag¬ 
netic,  and  electric  field  sensors. 


chines  in  order  to  examine  the  robustness  of  our  methods. 

The  organization  of  this  paper  is  as  follows:  Section  2 
introduces  the  multi-sensor  multi-modality  data  and  events. 
Section  3  discusses  the  feature  extraction  from  seismic, 
acoustic,  and  ultrasonic  sensors.  Section  4  discusses  Gaus¬ 
sian  mixture  model  classifiers,  decision  fusion,  and  Support 
Vector  Machines.  Section  5  describes  the  experiments  on 
the  multi- sensor  multi-modal  dataset.  We  conclude  this  pa¬ 
per  with  future  work  in  Section  6. 

2.  DATA 

In  this  paper,  we  use  a  multi-sensor  multi-modal  realistic 
dataset  collected  in  Arizona  by  the  U.S.  Army  Research  Lab 
and  the  University  of  Mississippi.  The  data  are  collected  in  a 
realistic  environment  in  an  open  field.  There  are  three  selected 
vantage  points  in  the  area.  These  three  points  are  known  to  be 
used  by  the  illegal  aliens  crossing  the  border.  These  places 
where  the  data  are  collected  include:  (a)  wash  (a  flash  flood 
river  bed  with  fine-grain  sand),  (b)  trail  (a  path  through  the 
shrubs  and  bushes),  and  (c)  choke  point  (a  valley  between  two 
hills.)  The  data  are  recorded  using  several  sensor  modalities, 
namely,  acoustic,  seismic,  passive  infrared  (PIR),  magnetic, 
E-field,  passive  ultrasonic,  sonar,  and  both  infrared  and  visi¬ 
ble  video  sensors.  Each  sensor  suite  is  placed  along  the  path 
with  a  spacing  of  40  to  60  meters  apart.  The  detailed  layout 
of  the  sensors  is  shown  in  Figure  1.  Test  subjects  walked  or 
ran  along  the  path  and  returned  back  along  the  same  path. 

A  total  of  26  scenarios  with  various  combinations  of  peo¬ 
ple,  animals  and  payload  are  enacted.  We  can  categorize  them 
as:  single  person  (11.6%),  two  people  (13%),  three  people 
(21.7%),  one  person  with  one  animal  (14.5%),  two  people 
with  two  animals  (15.9%),  three  people  with  three  animals 
(17.4%),  and  seven  people  with  a  dog  (5.9%),  where  the  ani¬ 
mals  can  be  a  mule,  a  donkey,  a  horse,  or  a  dog,  and  the  num¬ 
ber  in  the  parentheses  represents  the  percentage  of  the  data. 
The  data  are  collected  over  a  period  of  four  days;  each  day  at 
a  different  site  and  different  environment.  There  is  variable 
wind  in  the  recording  environment. 


Fig.  2.  The  overall  flow:  feature  extraction  based  on  phe¬ 
nomenology,  GMM  and  SVM  classifiers,  and  decision  and 
feature  fusion. 


2.1.  Active  Sensing 

The  time  duration  for  subjects  passing  by  is  short  (about  ten 
to  twenty  seconds  at  a  time)  compared  to  the  whole  recording 
time  (five  to  six  minutes  recording).  Without  any  ground  truth 
segmentation,  we  would  like  to  extract  the  time  duration  when 
test  subjects  are  passing  through.  This  problem  can  be  formu¬ 
lated  as  an  example  of  active  sensing  and  learning  [12,  13], 
which  refers  to  sequential  data  selection  and  inference  pro¬ 
cedures  that  actively  seek  out  highly  informative  data,  rather 
than  relying  on  non-adaptive  data  acquisition  solely. 

For  acoustic  sensors,  in  an  outdoor  scene,  the  signals  are 
contaminated  by  wind  sounds,  human  voices,  or  unexpected 
airplane  engine  sounds.  Seismic  and  PIR  sensors,  on  the  other 
hand,  are  relatively  clean.  Hence,  we  can  process  seismic  or 
PIR  sensors  by  an  energy  detection  to  determine  the  time  du¬ 
ration  when  test  subjects  pass  by.  If  the  energy  in  any  ten- 
second  interval  exceeds  a  threshold,  the  interval  is  marked 
’’active.”  Seismic  and  acoustic  signals  are  pre- synchronized; 
therefore  the  acoustic  active  integral  can  be  marked  on  the 
basis  of  seismic  energy.  Ultrasound  is  not  tightly  synchro¬ 
nized;  therefore  it  must  be  independently  segmented.  For 
each  recording,  there  are  two  active  segments  (walked  or  ran 
along  the  path  and  returned  back  along  the  same  path).  In 
this  paper,  we  emphasize  the  classification  of  segmented  mul¬ 
timodal  recordings  into  two  classes:  humans  only,  and  hu¬ 
mans  with  (four-legged)  animals. 

3.  FEATURES  EXTRACTION 

Features  are  extracted  from  seismic,  acoustic,  and  ultrasonic 
sensors.  The  overall  flow  is  shown  in  Figure  2. 

3.1.  Seismic 

Seismic  sensors  capture  the  vibrations  in  the  ground  caused 
by  the  motion  of  the  targets  or  ground  coupling  of  acoustic 
waves.  The  gait  patterns  of  humans  and  four-legged  animals 
differ.  Previous  approaches  do  not  consider  the  case  for  mul¬ 
tiple  human  and/or  four-legged  animals  [3,  5].  When  there 
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Fig.  3.  Seismic  feature  extraction  algorithm. 


are  multiple  human  and/or  four-legged  animals,  it  is  not  re¬ 
liable  to  estimate  the  gait  period  based  on  the  single  pitch 
(fundamental  frequency)  detection  method  [14,  15].  Inspired 
by  Park’s  temporal  gait  pattern  approach  [5]  and  the  progress 
in  multipitch  analysis  [16],  we  propose  a  gait  pattern  fea¬ 
ture  extraction  method  based  on  enhanced  summary  auto¬ 
correlation  [16],  as  shown  in  Figure  3.  A  typical  example 
of  enhanced  summary  autocorrelation  function  is  shown  in 
Figure  4,  where  the  same  subjects  generate  similar  enhanced 
summary  autocorrelation  patterns.  We  form  analytic  signals 
by  Hilbert  transform  and  then  use  full  wave  rectification  fol¬ 
lowed  by  low-pass  filtering  and  down-sampling  for  envelope 
detection.  Finally,  we  use  enhanced  summary  autocorrelation 
to  estimate  the  gait  pattern  and  generate  a  12-dimensional  fea¬ 
ture  vector  using  12  triangular  windows. 

The  idea  of  enhanced  summary  autocorrelation  is  to  prune 
the  periodicity  of  the  autocorrelation  function.  The  procedure 
is  the  following:  First,  from  the  envelope  signals,  the  autocor¬ 
relation  function  is  computed  within  each  channel  (2  chan¬ 
nels  in  the  model  of  Tolonen  and  Karjalainen  [16]).  Second, 
the  autocorrelation  functions  are  summed  up  across  the  chan¬ 
nels  to  form  a  summary  autocorrelation  function.  Third,  the 
summary  autocorrelation  function  is  clipped  to  positive  val¬ 
ues,  then  time-scaled  by  a  factor  of  two,  and  subtracted  from 
the  original  clipped  function.  Then,  the  same  procedure  is 
repeated  with  other  integer  factors  so  that  repetitive  peaks  at 
integer  multiples  can  be  removed.  The  resulting  function  is 
called  the  enhanced  summary  autocorrelation. 

3.2.  Acoustic 

In  acoustic  signals,  the  hoof  sounds  of  animals  such  as  horses, 
donkeys,  or  mules  are  perceptually  distinct  from  human  foot¬ 
step  sounds.  In  order  to  imitate  the  perceptual  discrimina¬ 
tion  abilities  of  human  listeners,  we  begin  by  using  Percep- 


One  person 


Three  People  with  Three 
Four-legged  Animals 


Fig.  4.  Examples  of  enhanced  summary  autocorrelation  of 
seismic  signals.  The  left  column  shows  examples  of  the  fea¬ 
ture  vector  for  one  person,  and  the  right  column  is  by  three 
people  with  three  four-legged  animals  at  three  different  time 
frames. 


tual  Linear  Predictive  (PLP)  features  [17],  which  are  com¬ 
mon  features  in  speech  recognition.  As  mentioned  in  Section 
2,  the  data  are  recorded  in  an  open  field.  There  are  noisy  wind 
sounds  in  the  recordings.  We  use  spectral  subtraction  to  re¬ 
duce  the  effect  of  noise  [18,  19]. 

From  the  active  segments  we  extracted  in  Section  2.1, 
we  further  extract  acoustic  features  from  short-time  footstep 
sounds  by  incorporating  seismic  signals.  Since  there  are  no 
labels  for  the  exact  time  of  footstep  sounds,  we  have  to  use 
the  seismic  sensor  information,  assuming  that  the  peaks  in 
the  seismic  signals  correspond  to  footsteps.  Suppose  there 
are  n  groups  of  peaks  (if  some  peaks  are  close  to  each  other, 
we  count  them  as  one  group)  in  the  seismic  signal,  whose 
times  are  t{ ,  for  i  =  1, . . . ,  n.  We  choose  a  small  time  S 
around  the  peaks  and  extract  PLP  features  within  the  time 
duration  (ti  «  S,  U  +  (5),  for  i  =  as  shown  in 

Figure  5.  In  each  time  period,  we  extract  13  PLP  features 
using  186ms  Hamming  windows  with  75%  overlap,  where 
1 86ms  is  approximately  equal  to  the  time  duration  of  a  single 
footstep  (from  heel  strike  to  toe  slap).  Delta  and  delta-delta 
coefficients  are  appended  to  create  a  39-dimensional  feature 
vector. 
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Our  goal  is  to  classify  humans  only  vs.  humans  with  an¬ 
imals.  In  the  humans  with  animals  class,  there  are  instances 
of  human  footstep  sounds.  Therefore,  there  are  some  overlap 
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Fig.  5.  Using  peaks  of  seismic  signals  for  matching  acoustic 
footstep  sounds. 


Fig.  6.  Left:  an  example  of  feature  space  of  humans  only 
and  humans  with  animals  class.  Right:  an  example  of  feature 
space  of  humans  only  and  estimated  animals  only  class,  after 
exemplar  selection. 


Fig.  7.  Multi-stage  framework  for  acoustic  exemplar  selec¬ 
tion. 


between  the  two  classes  in  the  feature  space,  as  shown  on  the 
left  hand  side  of  Figure  6.  Regularized  discriminative  meth¬ 
ods  such  as  support  vector  machines  (SVM)  explicitly  trade 
off  the  degree  of  class  overlap  vs.  the  complexity  of  the  de¬ 
cision  boundary  in  order  to  minimize  an  estimate  of  expected 
risk.  Generative  models,  on  the  other  hand,  model  overlap 
only  to  the  extent  permitted  by  the  specified  generative  model. 

In  order  to  improve  the  classifiers’  ability  to  compensate 
for  class  overlap,  therefore,  we  propose  a  multi-stage  algo¬ 
rithm  for  exemplar  selection,  as  shown  in  Figure  7 ;  this  frame¬ 
work  is  similar  to  the  ’’self-training”  methods  used  in  semi- 
supervised  learning.  The  idea  of  the  framework  is  to  select 
the  exemplar  frames  in  the  humans  with  animals  class  which 
are  dissimilar  to  the  features  in  the  humans  only  class.  With 
the  exemplar  selection  method,  classifiers  are  easier  to  learn 
the  distinctive  features  between  classes  as  shown  on  the  right 
hand  side  of  Figure  6.  The  algorithm  is  as  follows: 

1 .  Train  an  exemplar  selection  classifier  (SVM  or  GMM) 
for  humans  only  and  humans  with  animals  using  train¬ 
ing  data  as  shown  in  the  left  block  of  Figure  7. 

2.  Label  the  training  data  of  the  humans  with  animals 
class  using  the  trained  models  as  shown  in  the  middle 
block  of  Figure  7.  Each  frame  in  the  training  data  is 
labeled  as  either  the  humans  only  class  or  the  humans 
with  animals  class. 


3.  Keep  the  frames  which  were  labeled  as  humans  with 
animals ;  in  other  words,  discard  the  frames  which  were 
labeled  as  humans  only. 

4.  Train  a  new  classifier  (SVM  or  GMM)  between  the  es¬ 
timated  animals  only  class  and  the  humans  only  class 
as  shown  in  the  right  block  of  Figure  7. 

Note  that  the  acoustic  features  capture  short-time  footstep 
sounds  as  features,  while  seismic  and  ultrasonic  features  uti¬ 
lize  temporal  pattern  information.  Therefore,  the  multi-stage 
exemplar  selection  framework  applies  for  acoustic  features 
only. 

3.3.  Ultrasound 

Ultrasonic  sensors,  also  known  as  acoustic  Doppler  sensors 
[9],  emit  acoustic  waves  toward  objects  and  receive  reflected 
responses  from  objects.  Benefits  of  using  ultrasonic  sensors 
include  low  cost  ($5  USD  in  2011)  and  low  power.  The 
limitation  is  that,  because  of  the  rapid  attenuation  of  high- 
frequency  acoustic  waves,  ultrasonic  sensors  have  a  limited 
range  on  the  order  of  ten  meters. 

By  measuring  the  frequency  shift  of  a  wave  scattered  or 
radiated  by  a  moving  object,  the  velocity  of  the  object  relative 
to  an  observer  can  be  calculated;  this  is  known  as  the  Doppler 
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effect.  If  the  object  contains  moving  parts,  each  moving  part 
will  result  in  a  modulation  of  the  base  Doppler  frequency 
shift,  which  is  known  as  the  micro-Doppler  effect.  Given  an 
acoustic  wave  transmitted  by  an  observer,  the  frequency  of 
the  received  wave  by  a  single  point  scatterer  is 

/  =  a(i  +  |)  (i) 

where  /o  is  the  frequency  of  the  transmitted  acoustic  wave,  v 
is  the  velocity  of  the  scattered  wave  relative  to  the  observer 
and  c  is  the  speed  of  sound.  The  Doppler  frequency  shift, 
A /  =  is  proportional  to  the  scattered  wave  velocity  rela¬ 
tive  to  the  observer. 

A  human  body  is  an  articulated  object,  comprising  a  num¬ 
ber  of  rigid  bones  connected  by  joints.  When  a  continu¬ 
ous  tone  is  incident  on  an  animal  or  a  walking  person,  the 
reflected  signal  contains  a  spectrum  of  frequencies  by  the 
Doppler  shifts  of  the  carrier  tone  because  of  the  velocities  of 
various  moving  body  parts. 

As  reported  in  Zhang  et  al.  [10],  based  on  different  phys¬ 
ical  walking  mechanisms,  the  micro-Doppler  gait  signatures 
between  a  person  and  a  four-legged  animal  are  different.  We 
use  this  concept  to  extract  features  in  order  to  distinguish  be¬ 
tween  humans  and  four-legged  animals. 

For  ultrasound  signal  processing,  given  the  data  with  two 
channels,  25  kHz  and  40  kHz,  we  first  use  a  band-pass  filter 
with  stopband  at  20  kHz  and  30  kHz  and  passband  at  22.5 
kHz  and  27.5  kHz  for  25  kHz  channel,  and  a  band-pass  filter 
stopband  at  30  kHz  and  45  kHz  and  passband  at  37.5  kHz  and 
42.5  kHz  for  40  kHz  channel.  Then,  we  use  Hilbert  transform 
demodulating  the  captured  Doppler  signals  to  emphasize  the 
contributions  of  various  velocities.  Finally,  we  use  cepstral 
coefficients  for  representing  the  patterns  in  the  spectrogram 
[11].  We  use  62ms  Hamming  window  with  75%  overlap. 
The  80-dimensional  feature  vector  includes  as  cepstral  coef¬ 
ficients  and  their  deltas. 

4.  METHODS 

4.1.  Gaussian  Mixture  Model  Classifiers 

The  motivation  for  using  Gaussian  mixture  densities  is  that  a 
sufficiently  large  linear  combination  of  Gaussian  basis  func¬ 
tions  is  capable  of  representing  any  differentiable  sample  dis¬ 
tribution  [20,  21]. 

A  Gaussian  mixture  density  is  a  weighted  sum  of  M  com¬ 
ponent  densities,  as  shown  in  the  following  equation, 

M 

p(x  |  A)  =  ^2pibi(x)  (2) 

i=  1 

where  f  is  a  D-dimension  random  vector,  bi(x ),  i  = 
1, . . . ,  M,  are  the  component  densities  and  Pi,i  =  1, . . . ,  M, 


are  the  mixture  weights.  Each  component  density  is  a  D- 
variate  Gaussian  function  of  the  form 

bi (x)  =  (27r)£>/2]  ^  |l/2  exP{-\{%  -  (x  -  p,i)} 

(3) 

with  mean  vector  jli  and  covariance  matrix  The  mix¬ 
ture  weights  are  constrained  by  YiLiPi  =  1-  The  complete 
Gaussian  mixture  density  is  parameterized  by  the  mean  vec¬ 
tors,  covariance  matrices  (we  use  diagonal  covariance  matri¬ 
ces  here)  and  mixture  weights  from  all  component  densities. 
These  parameters  are  collectively  represented  by  the  notation 
A  =  {pi,  jli,  £$},  i  =  1, . . . ,  M.  For  classification,  each  class 
is  represented  by  a  GMM  parameterized  by  A. 

Given  training  data  from  each  class,  the  goal  of  model 
training  is  to  estimate  the  parameters  of  the  GMM.  Max¬ 
imum  likelihood  model  parameters  are  estimated  using  the 
Expectation-Maximization  (EM)  algorithm.  Generally,  ten  it¬ 
erations  are  sufficient  for  parameter  convergence. 

The  objective  is  to  find  the  class  model  that  has  the  max¬ 
imum  a  posteriori  probability  for  a  given  observation  se¬ 
quence  X.  Assuming  equal  likelihood  for  all  classes  (i.e., 
p(\k)  =  1/A0 ,  the  classification  rule  simplifies  to 

T 

N  =  argmax p(X\Xk)  =  argmax  >  log  p(xt\Xk)  (4) 

l<k<N  l<k<N 

where  the  second  equation  uses  logarithms  and  the  indepen¬ 
dence  between  observations.  T  is  the  number  of  observations. 

4.2.  Decision  Fusion 

GMMs  are  trained  for  each  modality  and  their  log  probabili¬ 
ties  are  combined  as 

S\{x)=  ^2  wmlogP(xm\\)  (5) 

rneM 

where  M  =  {a,  s,u},  a,  s,u  represents  acoustic,  seismic, 
and  ultrasound  modalities,  respectively.  If  all  likelihood  func¬ 
tions  were  correctly  trained,  and  if  the  vectors  xa,  xs,  and 
xu  were  conditionally  independent  given  class  label,  then  the 
Bayes-optimal  mode  weights  would  be  =  1.  In  practive 
the  likelihood  functions  tend  to  be  overconfident;  therefore, 
we  scale  them  using  0  <  wm  <  1,  YmeM  wm  =  1. 

For  simplicity,  we  choose  weights  by  a  grid-search  of 
global  weights  on  validation  sets  [22].  Note  that  Equation  (5) 
corresponds  to  a  linear  combination  in  the  log-likelihood  do¬ 
main;  however,  it  does  not  represent  a  probability  distribution 
in  general,  and  will  be  referred  to  as  a  score. 

4.3.  Support  Vector  Machines 

A  Support  Vector  Machine  (SVM)  estimates  decision  sur¬ 
faces,  g(x)  =  w T0(x)  +  b ,  directly  [23],  rather  than  mod¬ 
eling  a  probability  distribution  from  the  training  data.  Given 


425 


training  feature  vectors  Xi  G  Rn ,  i  =  1, . . . ,  k  in  two  classes 
with  label  yi  G  {1,  —1},  i  =  1, ,  fc,  a  SVM  solves  the 
following  optimization  problem: 

min  |wrw  +  C  V-'_1  6 

to,  6,  j 

subject  to  ^(wT0(xi)  +  6)  >  1  —  & 

&  >  0,i  =  1, . . .  ,k 

where  0(x i)  maps  Xi  onto  a  higher  dimensional  space,  (7  > 
0  is  the  regularization  parameter,  and  ^  is  a  slack  variable, 
which  measures  the  degree  of  misclassification  of  the  datum 

Xi. 

The  solution  can  be  written  as  w  satisfies  w  = 
Z)i=i  2/iai0(xO>  where  0  <  a*  <  C,  i  =  1, . . . ,  k,  and 
the  decision  function  is 

h(x)  =  sgn  ^Y^yiQaKix i5x)  +  bj  (6) 

where  K{x. i,  x)  =  </>(xi)T</>(x)  is  the  kernel  function.  In  this 
paper,  we  use  LIB  SVM  with  Radial  Basis  Function  (RBF) 
kernels,  that  is,  Ff(x i,xj)  =  exp(— 7I  |xi  —  x j  1 1 2 )  [24]. 

5.  EXPERIMENTS 

In  this  section,  we  describe  three  experiments  in  order  to  com¬ 
pare  our  proposed  methods  with  previous  approaches  in  clas¬ 
sifying  humans  only  vs.  humans  with  four-legged  animals. 
There  are  69  recordings  in  the  dataset.  We  divide  the  record¬ 
ings  into  four  groups  and  choose  two  for  training  and  two  for 
testing  at  a  time,  resulting  in  a  six-fold  cross-validation.  In 
each  fold,  we  randomly  select  a  part  of  recordings  from  train¬ 
ing  and  testing  sets  as  a  validation  set.  We  choose  the  best 
mixture  count  for  the  GMM  classifier  and  parameters  7  and 
C  for  the  SVM,  according  to  the  validation  set.  The  experi¬ 
mental  results  are  represented  by  mean  i  standard  error. 

5.1.  Seismic  features 

As  describe  in  Section  3.1,  we  compare  our  gait  pattern  fea¬ 
tures  based  on  enhanced  summary  autocorrelation  with  the 
temporal  gait  pattern  [5]  under  the  same  experimental  setup. 
The  experimental  results  are  shown  in  Table  1 . 


Feature 

Accuracy  (%) 

GMM 

SVM 

Temporal  gait  pattern  [5] 

Enhanced  summary  autocorrelation  pattern 

71.883i4.607 

81.707i2.564 

79.010i4.648 

84.446i2.868 

Table  1.  Classification  accuracy  using  seismic  features. 


From  the  experimental  results  of  Table  1,  our  proposed 
method  using  enhanced  summary  autocorrelation  pattern  out¬ 
performs  the  previous  method  [5]  in  both  GMM  and  SVM 
classifiers,  because  the  previous  method  did  not  consider  the 
case  of  multiple  objects.  Compared  with  GMM  classifiers  [5], 
the  experimental  results  show  that  SVM  has  a  better  discrim¬ 
ination  between  the  two  classes  for  seismic  features. 


5.2.  Acoustic  features 

As  described  in  Section  3.2,  we  want  to  examine  the  effect  of 
using  (1)  spectral  subtraction,  (2)  seismic  peaks  with  differ¬ 
ent  S’ s,  and  (3)  our  proposed  multi-stage  exemplar  selection 
framework  using  GMM  and  SVM  classifiers  as  the  first  step 
of  the  algorithm.  The  experimental  results  are  shown  in  Table 
2. 

The  first  row  PLP  features  without  (1)(2)(3)(4)  in  Table 
2  represents  using  the  active  audio  segments,  without  using 
the  duration  estimated  by  the  peaks  of  seismic  signals,  and 
without  using  spectral  subtraction.  Spectral  subtraction  (row 
2)  improves  the  performance  for  both  classifiers. 

It  is  helpful  to  further  extract  audio  features  from  the  time 
durations  marked  by  peaks  of  seismic  signals.  This  method 
utilizes  both  the  characteristics  of  acoustic  and  seismic  sen¬ 
sor  in  the  sensor  suites.  Without  using  this  method,  there  are 
many  silence  or  noise  segments  in  the  audio  signals,  and  the 
silence  or  noise  signals  make  both  classifiers  ill-trained. 

Moreover,  different  values  of  8  capture  different  amounts 
of  acoustic  information.  The  results  show  that  (5=0. 3  s  has  the 
best  performance  compared  with  <5=0. Is  and  (5=0. 5s.  The  seis¬ 
mic  sensor  and  acoustic  sensor  are  not  at  exactly  the  same 
place  and  the  rates  of  propagation  are  different.  Therefore, 
there  are  asynchronies  between  acoustic  and  seismic  signals. 
Specifically,  with  (5=0. Is,  the  acoustic  segment  does  not  con¬ 
tain  the  entire  footstep  sound.  On  the  other  hand,  with  (5=0. 5s, 
the  acoustic  signals  include  too  much  unrelated  noise.  These 
reasons  may  explain  the  performance  variation  of  both  classi¬ 
fiers. 

For  our  proposed  multi-stage  exemplar  selection  frame¬ 
work,  using  GMM  for  exemplar  selection  improves  the  ac¬ 
curacy  around  1~2%  for  GMM  classifiers;  on  the  contrary, 
using  GMM  for  exemplar  selection  degrades  the  accuracy 
for  SVM  classifiers.  A  possible  reason  is  that  SVM  implic¬ 
itly  chooses  support  vectors  for  the  hyperplane  in  the  feature 
space.  Using  GMM  selected  features,  the  SVM  has  less  infor¬ 
mation,  and  hence  has  worse  performance.  On  the  other  hand, 
using  SVM  for  exemplar  selection  degrades  performance  in 
all  cases.  A  possible  explanation  is  that  the  SVM  cannot  se¬ 
lect  proper  exemplar  in  the  case  of  overlapping  feature  space 
in  the  first  stage. 
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Feature 

Accuracy  (%) 

GMM 

SVM 

PLP  features  without  (1)(2)(3)(4) 
PLP  features  with  (1) 

73.768±2.230 

76.105±4.098 

65.337±1.896 

71.698±4.572 

PLP  features  with  (1)(2),  5=0. Is 
PLP  features  with  (1)(2)(3),  5=0. Is 
PLP  features  with  (1)(2)(4),  5=0. Is 

74,975±5.079 

75.737±2.936 

72.735±4.585 

78.093±1.699 

76.604±2.179 

75.090±2.577 

PLP  features  with  (1)(2),  5=0.3s 
PLP  features  with  (1)(2)(3),  5=0. 3s 
PLP  features  with  (1)(2)(4),  5=0. 3s 

77.555±4.268 

79.015±3.799 

75.325±3.739 

80.578±3.113 

72.638±2.727 

77.196±1.706 

PLP  features  with  (1)(2),  5=0.5s 
PLP  features  with  (1)(2)(3),  5=0.5s 
PLP  features  with  (1)(2)(4),  5=0.5s 

75.392±3.376 

77.688±3.149 

74.800±4.523 

76.214±4.396 

74.507±3.634 

71.313±3.456 

Table  2.  Classification  accuracy  using  acoustic  features, 
where  (1)  represents  spectral  subtraction,  (2)  represents  the 
use  of  seismic  peaks  with  different  5  second  (s),  and  (3)  rep¬ 
resents  the  use  of  our  proposed  multi-stage  exemplar  selection 
framework  using  a  GMM  classifier  as  the  first  step  of  the  al¬ 
gorithm.  (4)  represents  the  use  of  our  proposed  multi-stage 
exemplar  selection  framework  using  a  SVM  classifier  as  the 
first  step  of  the  algorithm. 

5.3.  Decision  fusion  and  feature  fusion  with  seismic, 
acoustic,  and  ultrasonic  features 

We  perform  multimodal  fusion  in  a  classifier-dependent  fu¬ 
sion:  decision  fusion  with  GMMs,  feature  fusion  (vector  con¬ 
catenation)  with  SVM.  Note  that,  for  ultrasonic  data,  within 
186ms,  there  are  eight  moving  windows  resulting  in  a  640- 
dimensional  feature  vector.  We  use  principal  component  anal¬ 
ysis  (PC  A)  keeping  99%  of  the  energy,  and  reduce  features  to 
7  dimensions. 

We  compare  our  proposed  methods  using  GMM  and  SVM 
classifiers,  as  shown  in  Table  3.  Row  1  of  Table  3  repre¬ 
sents  the  use  of  ultrasonic  features,  enhanced  summary  au¬ 
tocorrelation  pattern,  PLP  features  with  spectral  subtraction, 
seismic  peaks  with  5=0. 3s,  and  the  multi-stage  exemplar  se¬ 
lection  framework  using  GMM  classifiers;  Row  2  of  Table  3 
represents  the  use  of  the  same  seismic,  ultrasonic  features  as 
Row  1,  and  acoustic  features  without  the  multi-stage  exem¬ 
plar  selection.  Row  3  of  Table  3  represents  the  use  of  tempo¬ 
ral  gait  pattern  [5],  PLP  features  without  spectral  subtraction, 
using  the  whole  active  segments,  and  without  the  multi-stage 
exemplar  selection.  Row  4  of  Table  3  represents  the  use  of 
ultrasonic  features. 

In  Table  3,  our  proposed  method,  using  seismic  and 
acoustic  features  along  with  ultrasonic  features,  greatly  im¬ 
proves  the  robustness  compared  with  previous  approaches. 
With  the  exemplar  selection  framework,  GMM  classifiers 
achieve  the  best  fusion  accuracy.  The  SVM,  however,  per¬ 
forms  worse  with  exemplar  selection,  as  mentioned  above. 
The  classification  task,  using  only  ultrasonic  features  (last 
row),  is  roughly  7%  better  with  SVM  classifiers  compared 
with  GMM  classifiers. 

We  analyze  the  errors  in  the  (1)(3)(5)  in  the  GMM  deci- 


Feature 

Accuracy  (%) 

GMM 

SVM 

(1)(3)(5) 

(1)(2)(5) 

(4)(5) 

(5) 

86.092±2.313 

84.928±2.790 

81.903±3.144 

75.528±3.564 

84.446±2.868 

85.307±3.405 

81.041=bl.754 

82.188±3.466 

Table  3.  Classification  accuracy  using  decision  fusion  (GMM 
classifier)  and  feature  fusion  (SVM  classifier),  where  (1)  rep¬ 
resents  the  enhanced  summary  autocorrelation  pattern,  (2) 
represents  PLP  features  with  spectral  subtraction  and  seismic 
peaks  with  5=0. 3s,  (3)  represents  (2)  with  the  multi-stage  ex¬ 
emplar  selection  framework  using  a  GMM  classifier  as  the 
first  step  of  the  algorithm,  (4)  represents  the  use  of  tempo¬ 
ral  gait  pattern  [5],  PLP  features  without  spectral  subtraction, 
using  the  whole  active  segments,  and  without  the  multi-stage 
exemplar  selection,  and  (5)  represents  ultrasonic  features. 

sion  fusion  case.  Among  the  six-fold  cross-validations,  the 
recordings  of  the  event,  seven  people  with  a  dog ,  are  all  in¬ 
correctly  classified  as  human  only.  This  accounts  for  52.6% 
of  all  errors.  A  possible  explanation  is  that,  dogs  have  padded 
feet  (instead  of  hoofs)  and  are  relatively  small.  It  is  difficult 
to  tell  dogs  from  humans  because  the  classifier  has  learned  to 
recognize  hoof  sounds.  The  limited  amount  of  data  for  this 
event  means  that  the  classifier  is  unable  to  learn  its  distinctive 
pattern. 

6.  CONCLUSION 

In  this  paper,  we  use  a  challenging  realistic  multi- sensor 
multi-modal  dataset  for  personnel  detection.  Based  on  phe¬ 
nomenology  of  the  differences  (gait  pattern,  footstep  sound, 
and  micro-Doppler  motion)  between  humans  and  four-legged 
animals,  we  propose  using  a  new  seismic  feature  extraction 
method  based  on  enhanced  summary  autocorrelation,  a  multi¬ 
stage  acoustic  exemplar  selection  framework,  and  temporal 
patterns  from  ultrasonic  sensors.  Experimental  results  show 
that  the  combination  of  multi-modal  sensors  improves  the  ro¬ 
bustness  of  the  system  over  previous  approaches.  Since  it 
is  inexpensive  to  deploy  unattended  ground  sensors  such  as 
acoustic,  seismic,  and  ultrasonic  sensors  in  target  areas;  it  is 
possible  to  further  extend  the  current  fusion  system  to  create 
a  tracking  system  based  on  sensor  network  fusion. 
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